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We consider the problem of detection and localization of a small 
block of weak activation in a large matrix, from a small number of 
' noisy, possibly adaptive, compressive (linear) measurements. This is 

closely related to the problem of compressed sensing, where the task 
is to estimate a sparse vector using a small number of linear measure- 
ments. However, contrary to results in compressed sensing, where it 
has been shown that neither adaptivity nor contiguous structure help 
much, we show that in our problem the magnitude of the weakest 
signals one can reliably localize is strongly influenced by both struc- 
ture and the ability to choose measurements adaptively. We derive 
tight upper and lower bounds for the detection and estimation prob- 
lems, under both adaptive and non-adaptive measurement schemes. 
We characterize the precise tradeoffs between the various problem 
parameters, the signal strength and the number of measurements re- 
quired to reliably detect and localize the block of activation. 

1. Introduction. Compressive measurements provide a very efficient 
means of recovering data vectors that are sparse in some basis or frame. 
Specifically, several papers, including Candes and Tao (2006), Donoho (2006), 
Candes and Tao (2007), Candes and Wakin (2008), and Wainwright (2009a) 
have shown that it is possible to recover a fc-sparse vector in n dimensions 
using only fclogn compressive measurements, instead of measuring all of the 
n coordinates. Motivated by this line of research, there have been recent at- 
tempts (Baraniuk et al., 2010, Soni and Haupt, 2011) at characterizing the 
number of compressive measurements needed to recover vectors that are 
endowed with some structure in addition to sparsity. Yet another exten- 
sion of the compressed sensing framework has been to attempt to recover 
vectors from few possibly adaptive compressed measurements, where sub- 
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sequent measurements are designed based on past observations (see, e.g., 
Candes and Davenport, 2011). Finally, there has also been work on detec- 
tion, instead of recovery, of sparse vectors from compressive measurements 
(Arias-Castro, 2012). However, almost all of this work has been focused on 
recovery or detection of (structured or unstructured) sparse data vectors 
from (passive or adaptive) compressed measurements. 

In this paper, we extend the compressed sensing paradigm to handle data 
matrices. In the unstructured case, the treatment of data matrices is ex- 
actly equivalent to the treatment of data vectors. The setting where data 
matrices are distinct from data vectors is when the sparsity pattern is struc- 
tured in a way that reflects some coupling between the rows and columns. 
We consider one such setup where there is a sub-matrix or block of acti- 
vation embedded in the data matrix. This is a natural model for several 
real-world activations such as when we have a group of genes (belonging to 
a common pathway for instance) co-expressed under the influence of a set of 
similar drugs (Yoon et al., 2005), when we have groups of patients exhibit- 
ing similar symptoms (Moore et al., 2010), when we have sets of malware 
with similar signatures (Jang et al., 2011), etc. However, in many of these 
applications, it is difficult to measure, compute or store all the entries of the 
data matrix. For example, measuring expression levels of all genes under all 
possible drugs is expensive, or recording the signatures of each individual 
malware is computationally demanding as it might require stepping through 
the entire malware code. However, if we have access to linear combinations 
of matrix entries (i.e. compressive measurements) such as combined expres- 
sion of multiple genes under the influence of multiple drugs then we might 
need to only make and store few such measurements, while still being able 
to infer the existence or location of the activated block of the data matrix. 
Thus, the goal is to detect or recover the activated block (set of co-expressed 
genes and drugs or malwares with similar signatures) using only few com- 
pressive measurements of the data matrix, instead of observing the entire 
data matrix directly. We consider both the passive (non-adaptive) and ac- 
tive (adaptive) measurements. The non-adaptive measurements are random 
or pre-specified linear combinations of matrix entries. In other cases, such 
as mixing drugs, we might be able to adapt the measurement process and 
sequentially design linear combinations that are more informative. 

Summary of our contributions. Using information theoretic tools, we 
establish lower bounds on the minimum number of compressive measure- 
ments and the weakest signal-to-noise ratio (SNR) needed to detect the 
presence of an activated block of positive activation, as well as to localize 
the activated block, using both non-adaptive and adaptive measurements. 
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Table 1 

Summary of main findings under the assumption that n\ — ni = n and ki = fa = k, 
where the size of the matrix is n x n and the size of the activation block is k x k. The 
number of measurements is m and y,/o~ represents SNR per element of the activated block. 
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f£ — 1 n 


jA — 1 n 


Theorems 3 and 4 
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Theorems 1 and 2 




Theorems 5 and 6 



We also demonstrate minimax optimal upper bounds through detectors and 
estimators that can guarantee consistent detection and recovery of weak 
block-structured activations using few non-adaptive and adaptive compres- 
sive measurements. 

Our results indicate that adaptivity and structure play a key role and 
provide significant improvements over non-adaptive and unstructured cases 
for recovery of the activated block in the data matrix setting. This is unlike 
the vector case where contiguous structure and adaptivity have been shown 
to provide minor, if any, improvement (Candes and Davenport, 2011). 

In our setting we take compressive measurements of a data matrix of size 
(rii x ri2), the activated block is of size (k± x fo), with minimum SNR per 
entry of [J, /a, and we have a budget of m compressive measurements with 
each measurement matrix constrained to have unit Frobenius norm. 

Table 1 describes our main findings (assuming n\ = = n and k\ = 
k2 = k and paraphrasing for clarity) and compare the scalings under which 
passive and active, detection and localization are possible. 

For detection, akin to the vector setting, structure and adaptivity play 
no role. The structured data matrix setting requires an SNR scaling as 
\/ n\n<2.l [mk\ for both non-adaptive and adaptive cases, which is same 
as the SNR needed to detect a k\k<i sparse non- negative vector of length 
rain2 as demonstrated in Arias-Castro (2012). Thus, the structure of the 
activation pattern as well as the power of adaptivity offer no advantage in 
the detection problem. 

For localization of the activated block, the structured data matrix set- 
ting requires an SNR scaling as \/ n i n 2/( mmm (^i) ^2)) using non-adaptive 
compressive measurements. In contrast, the unstructured setting requires a 
higher SNR of \Jn\n2 log(nin2)/m where m > k±k2 log(rairi2) as demon- 
strated in Wainwright (2009b). Structure, without adaptivity already yields 
a factor of yk reduction in the smallest SNR that still allows for reliable 
localization. Moreover, adaptivity in the compressive measurement design 
yields further improvements. With adaptive measurements, identifying the 
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activated block requires a much weaker SNR of max( ^ n\U2 / {mk^l^) , \J\jim min(&i, foj)) 
for the weakest entry in the data matrix. For the sparse vector case, Arias-Castro et al. 

(2011) showed that adaptive compressive measurements cannot recover the 
non-zero locations if the SNR is smaller than y /n i n 2 J / rn - A matching upper 
bound was provided using compressive binary search in Davenport and Arias-Castro 

(2012) and Malloy and Nowak (2012) for recovering the location of a single 
non-zero entry in the vector. Thus, exploiting structure of the activations 
and designing adaptive linear measurements can both yield significant gains 
if the activation corresponds to a block in a data matrix. 

Related Work. Our work builds on a number of fairly recent contribu- 
tions on detection and recovery of a sparse and weak unstructured signal 
by adaptive compressive measurements. In Arias-Castro et al. (2011), the 
authors show that, in the linear regression setting, the adaptive compres- 
sive scheme offers improvements over the passive scheme which, in terms of 
MSE, are limited to a log(re) factor. The authors also provide a general proof 
strategy for minimax analysis under adaptive measurements. Arias-Castro 
(2012) further applies this strategy to the problem of detection of an un- 
structured sparse and weak vector signal under compressive adaptive mea- 
surements. Malloy and Nowak (2012) shows that a compressive version of 
standard binary search achieves minimax performance for localization in a 
one-sparse vector. The work of Wainwright (2009b) which is based on an- 
alyzing the performance of an exhaustive search procedure under passive 
measurements, is relevant to our analysis of passive localization. Our analy- 
sis provides a generalization of these results to the case of a structured and 
weak signal embedded as a small contiguous block in a large matrix. 

While we focus on detection and localization of the activation in this 
paper, some other papers have considered estimation of sparse vectors in the 
mean square error (MSE) sense using adaptive compressive measurements. 
For example, Candes and Davenport (2011) establishes fundamental lower 
bounds on the MSE in a linear regression framework, while Haupt et al. 
(2009) demonstrates upper bounds using compressive distilled sensing. Some 
other papers (Baraniuk et al., 2010, Soni and Haupt, 2011) have considered 
different forms of structured sparsity in the vector setting, e.g. if the non- 
zero locations in a data vector form non-overlapping or partially-overlapping 
groups or are tree-structured. Finally, Negahban and Wainwright (2011) and 
Koltchinskii et al. (2011) have considered a measurement model identical to 
ours in the setting of low-rank matrix completion, but in that setting the 
matrix under consideration is not assumed to be a structured sparse matrix 
and the theoretical guarantees are with respect to the Frobenius norm. 

The rest of this paper is organized as follows. We describe the problem 
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set up and notation in Section 2. We study the detection problem in Section 
3, for both adaptive and non-adaptive schemes. Section 4 is devoted to the 
non-adaptive localization, while Section 5 is focused on adaptive localization. 
Finally, in Section 6 we present and discuss some simulations that support 
our findings. The proofs are given in the Appendix. 

2. Preliminaries. Let A £ ]R niXn2 be a signal matrix with unknown 
entries that we would like to recover. We are interested in a highly structured 
setting where a contiguous block of the matrix A of size (k% x k?) has entries 
all equal to fi > 0, while all the other elements of A are equal to zero. Define 
the set of contiguous blocks, 
(2.1) 

B = {I r xl c : I r and I c are contiguous subsets of [ni] and [^J 1 , \I r \ = k\, \I C 

Then A = (ay) with aij = fi G B*} for some (unknown) B* G B. All 

of our results extend to the case when the activation is not constant on B* , 
with min(jj) 6B . replacing /x in all our results. 

We consider the following observation model under which m noisy linear 
measurements of A are available 

(2.2) Vi =tr(AX i ) + e ij i = l,...,m, 

where e\, . . . ,e m *~ AA(0,<r 2 ), a > known, and the sensing matrices {Xi)i 
satisfy either ||Aj||^ < 1 or E||Aj||^ = 1. 

Under the observation model in Eq. (2.2), we study two tasks: (1) de- 
tecting whether a contiguous block of positive signal exists in A and (2) 
identifying the block B* , that is, the localization of B* . We develop efficient 
algorithms for these two tasks that provably require the smallest number of 
measurements, as explained below. The algorithms are designed for one of 
two measurement schemes: (1) the measurement scheme can be implemented 
in an adaptive or sequential fashion, that is, actively, by letting each Xi to be 
a (possibly randomized) function of (yj, Xj)j e u_i], and (2) the measurement 
matrices are chosen all at once, that is, passively. 

Detection. The detection problem concerns checking whether a positive 
contiguous block exists in A. As we will show later, we can detect presence 
of a contiguous block with much smaller number of measurements than is 
required for localizing its position. Therefore, solving the detection problem 
before trying to localize the block is often important. Formally, detection is 



x We use [n] to denote the set {1, . . . , n} 
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a hypothesis testing problem with a composite alternative of the form 

(r> o\ Ho '■ A = O ni xn 2 

Hi: A = (dij) with = \i I { ( ij)eB} , B £ B. 

A test T is a measurable function of the observations and the measure- 
ments matrices (yj, ^Q)ie[m]> which takes values in {0, 1}, T = 1 if the null 
hypothesis is rejected and T = otherwise. For any test T, we define its risk 
as 

R(T) = P [T(( yi ,XO i6W ) = 1] +maxP B [T({y h Xi) ie[m] ) = 0] , 

where Po and denote the joint probability distributions of ((y«, X)ie[m]) 
under the null hypothesis and when the activation pattern is B, respectively. 
The risk R(T) measures the maximal sum of type I and type II errors over 
the set of alternatives. The overall difficulty of the detection problem is 
quantified by the minimax risk R = infy R(T), where the infimum is taken 
over all tests. For a sufficiently small SNR, the minimax risk is bounded away 
from zero by a large constant, which implies that no test can distinguish Hq 
from Hi. We precisely characterize the boundary for SNR below which no 
test can distinguish Hq and H\. 

Localization. The localization problem concerns recovery of the true 
activation pattern B*. Let be an estimator of B* , with the risk, corre- 
sponding to a 0/1 loss, given by 

= maxP B [*((yi,Xi) ie[m] ) + B] , 

£f(lO 

while the minimax risk of the localization problem is the minimal risk over 
all such estimators Like in the detection task, the minimax risk specifies 
the minimal risk of any localization procedure. By standard arguments, the 
evaluation of the minimax localization risk also proceeds by first reducing the 
localization problem to a hypothesis testing problem (see, e.g., Tsybakov, 
2009, for details). 

Below we will provide a sharp characterization, through information the- 
oretic lower bounds and tractable estimators, of the minimax detection and 
localizations risks as functions of tuples of (ni,ri2,ki,k2,m, fi,cr) and for 
both the active and passive sampling schemes. Our results identify precisely 
both the minimal SNR given a budget of m possibly adaptive measurements, 
and the minimal number of measurements m for a given SNR in order to 
achieve successful detection and localization. 

Along with a careful and detailed minimax analysis, we also describe 
procedures for detection and localization in both the active and passive case 
whose risks match the minimax rates. 
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3. Detection of contiguous blocks. In this section, we provide a 
sharp characterization of the minimax detection risk. 

3.1. Lower bound. The following theorem gives a lower bound on the 
SNR needed to distinguish Hq and H\. 

Theorem 1. Fix any < a < 1. Based on m (possibly adaptive) mea- 
surements, if fi < jUmin , where 



then any test to distinguish Hq from Hi, defined in Eq. (2.3), has risk at 
least a. 

The result of Theorem 1 can be interpreted as follows: whatever the test 
T and the risk level a are, there exists A = (o^-) with o^- = fj, TL{(i,j) £ B*}, 
M < Mmini such that R(T) > a. This gives a lower bound on the minimax 
risk as inf^ R(T) > a. 

The lower bound on possibly adaptive procedures is established by ana- 
lyzing the risk of the (optimal) likelihood ratio test under a certain prior on 
the alternatives. Careful modifications of standard arguments are necessary 
to account for adaptivity. We closely follow the approach of Arias-Castro 
Arias-Castro (2012) who established the analogue of Theorem 1 in the vec- 
tor setting. 

3.2. Upper bound. We now discuss the sharpness of the result established 
in the previous section. We choose the sensing matrices passively as Xi = 
(nin2) _1//2 lni ln 2 an d consider the following test 



test defined in Eq. (3.1). 

Results of Theorem 1 and Theorem 2 establish that the minimax rate 
for detection under the model in Eq. (2.2) is fi x o(kik^~ 1 \fm~^n\^, 
under the (mild) assumption that ki < cni and ki < cn2 for any constant 
< c < 1. It is worth pointing out that the structure of the activation 
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Theorem 
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pattern does not play any role in the minimax detection problem. We will 
contrast this to the localization problem below. Furthermore, the procedure 
that achieves the adaptive lower bound (upto constants) is non-adaptive, 
indicating that adaptivity can not help much in the detection problem. 

4. Localization from passive measurements. In this section, we 
address the problem of estimating a contiguous block of activation B* from 
noisy linear measurement in Eq. (2.2), when the measurement matrices 
(-Xj)ie[ m ] are independent with i.i.d. entries Xi A b ~ AA(0, (n\n 2 )~ l ). The 
variance of the elements is set so that E||Xj|||i = 1. 

4.1. Lower bound. The following theorem gives a lower bound on the 
SNR needed for any procedure to localize B*. 

Theorem 3. There exist two positive constant C, C > independent of 
the problem parameters {k\, k 2 , n 1; n 2 ), such that if \i < /uJ^F n , where 



Mmin : = CaJ max 

\/ m 



n\n 2 ( 1 logmax(n.i — k\, n 2 — k 2 ) 



min(fei,fe2)' k\k 2 
then inf* R($) > C > as n ->■ oo. 

The proof is based on a standard technique described in Chapter 2.6 of 
Tsybakov (2009). We start by identifying a subset of matrices from A which 
are hard to distinguish. Once a suitable finite set is identified, tools for 
establishing lower bounds on the error in multiple-hypothesis testing can be 
directly applied. These tools only require computing the Kullback-Leibler 
(KL) divergence between the induced distributions, which in our case are 
two multivariate normal distributions. 

The two terms in the lower bound feature two aspects of our construction, 
the first term arises from considering two matrices that overlap considerably, 
while the second term arises from considering matrices that do not overlap 
at all of which there are possibly a very large number. These constructions 
and calculations are described in detail in the Appendix. 

4.2. Upper bound. We will investigate a procedure that searches over all 
contiguous blocks of size {k\ x k 2 ) defined in Eq. (2.1) and outputs one that 
minimizes the squared error. Define the loss function / : B t— > K as 

(4.1) f(B) := min ^ (jj, ^ x^ ab - y { 

iGn (a,b)eB 
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Then the estimated block B is defined as 
(4.2) B := argmin/(£). 

Note that the minimization problem above requires solving 0(nin 2 ) univari- 
ate regression problems and can be implemented efficiently for reasonably 
large matrices. 

The following results characterizes the SNR needed for B to correctly 
identify B* . 

Theorem 4. There exists a positive constant C > independent of the 
problem parameters (kx,k2,ni, n 2 ), such that if 

/logmax(/ci,/c 2 ) logmax(m - h,n 2 - k 2 )\ 
^ { minfo.fr) ' hk 2 J ' 

then R(B) < a, where B is defined in Eq. (4.2). 

Comparing to the lower bound in Theorem 3, we observe that the proce- 
dure outlined in this section achieves the lower bound up to constants and 
a log k factor. Under the scaling min(fci, k 2 ) > log max(ni — ki, n 2 — k 2 ), we 
obtain that the passive minimax rate for localization of the active blocks B* 
is fi x 0{tJiJ (mmin(fci, k^^n^n^ ■ This establishes that the SNR needed 
for passive localization is considerably larger than the bound we saw earlier 
for passive detection. This should be contrasted to the normal means prob- 
lem, where the bounds for localization and detection differ only in constants 
(Donoho and Jin, 2004). 

The block structure of the activation allows us, even in the passive setting, 
to localize much weaker signals. A straightforward adaptation of results on 
the LASSO (Wainwright, 2009a) suggest that if the non-zero entries are 
spread out (say at random) then we would require n x O (^W' 2 ®') for 
localization. 

5. Localization from active measurements. In this section, we study 
localization of B* using adaptive procedures, that is, the measurement ma- 
trix Xi may be a function of (yj, -XjOjefi— l] . 

5.1. Lower bound. A lower bound on the SNR needed for any active 
procedure to localize B* is given. 
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Theorem 5. Fix any < a < 1. Given m adaptively chosen measure- 

, • £ . loc, active j 

ments, ij fi < fJ>„ ' , where 



ioc,active /-, n / /2 max((n.i - k\){n 2 /2 - k 2 ), (n x /2 - k{){n 2 - k 2 )) 



i/ien inf* > a. 



The proof is based on an information theoretic arguments applied to spe- 
cific pairs of hypotheses that are hard to distinguish. The two terms in the 
lower bound reflect the two sources of hardness of the problem of exactly 
localizing the block of activation. The first term reflects the hardness of ap- 
proximately localizing the block of activation. This term grows at the same 
rate as the detection lower bound, and its proof is similar. Given a coarse 
localization of the block we still need to exactly localize the block. The hard- 
ness of this problem gives rise to the second term in the lower bound. The 
term is independent of n\ and n 2 but has a considerably worse dependence 
on k\ and k 2 . 

5.2. Upper bound. The upper bound is established by analyzing the pro- 
cedures described in Algorithms 1 and 2 for approximate and exact local- 
ization. Algorithm 1 is used to approximately located the activation block, 
that is, it locates a 2s x 2s blocks that contains the activation block with 
high probability. The algorithm essentially performs the compressive binary 
search on a collection of non-overlapping blocks that partition the signal 
matrix. It is run on two collections, V>\ and T> 2 , defined as 

V x = {J3n = [1, . . . , 2fci][l, . . . , 2k 2 ] U fl u = [2h + 1, . . . , 4fci] [1, . . . , 2k 2 ] U . . . 

• • • U B lniU2/Aklk2 = [m -2ki,.. . ,ni][n 2 - 2k 2 , . . . , n 2 ]} 

and 

V 2 = {B 2l = [fci,..., 3A*] [k 2 , 3k 2 ] U B 22 = [3fci + 1, . . • , 5h][k 2 , . . . , 3k 2 ] U . . . 

U • • • B 2niri2 / iklk2 = [m - h, ...,ni, 1, . . . , k{\[n 2 - k 2 , ...,n 2 , 1, . . .,k 2 ]} . 

Notice, that one of these collections must contain a block with the full block 
of activation. Algorithm 1 applied twice returns two blocks, one of which as 
we show has the desired block with high probability. 

Algorithm 2 is used next to precisely locate the activation block within 
one of the two coarser blocks identified by Algorithm 1. Algorithm 2 is a 
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Algorithm 1 Approximate localization 

input Measurement budget m > logp, (dyadic) ordered collection of size p of blocks T> 
of size (in x 112) 

Initial support: Jq 1 ' = {1, . . . ,p}, s = logp 
For each s in 1, . . . , log 2 p 

1. Allocate: m a = [_(m — So)s2~ s_1 J + 1 

2. Split: and j| , left and right half collections of blocks of Jg ' 

3. Sensing matrix: X s = ,/ 2 s ° a+1 on jj s ' X s = —\ 2 ( °° — on J, s ' and 
otherwise. 

4. Measure: j/? s) = tr(AA s ) + zf, for i € [1, . . . ,m s ] 

5. Update support: J ( S+1) = j} 6 ' if y\ a) > and J, ( , s+1) = J { 2 s) otherwise 
output The single block in Jq SQ+1 \ 



modified compressive binary search procedure that is used to quickly zoom 
in on the active rows and columns within a larger block. 

The following theorem states that Algorithm 1 and Algorithm 2 succeed in 
localization of the active block with high probability if SNR is large enough. 

Theorem 6. // 

fi > o-\/log(l/a)d ( max ( ,/ "^"f^ \ . 1 , r ) ) 
V \\J mk(k% W mm(fei, k 2 )m I I 

and m > 31og(?iin2) then inf^, R(^) < a. 

The O hides a -y/log max(/ci, ^2) factor, and our upper bound matches the 
lower bound up to this factor. It is worth noting that for small activation 
blocks (when the first term dominates) our active localization procedure 
achieves the detection limits. This is the best result we could hope for. For 
larger activation blocks, the lower bound indicates that no procedure can 
achieve the detection rate. The active procedure still remains significantly 
more efficient than the passive one, and even in this case is able to detect 
signals that are weaker by a (large) y/n\ni factor. This is not the case 
for compressed sensing with vectors as shown in Arias-Castro et al. (2011). 
The great potential for gains from adaptive measurements is clearly seen in 
our model which captures the fundamental interplay between structure and 
adaptivity. 
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Algorithm 2 Exact localization 

input Measurement budget 5m, a sub- matrix B £ flj 4fc i x4fc 2 

Measure: y\ = (4fc!)- 1/2 J2tti B ^ + z i> for i = {1, • • • , m} and c € {1, fc 2 + 1, 2fa + 

1,3*2 + 1} 

Let I = argmax c E-=i 2/i 
Let r — I + k2 
Let m 6 = I m r I 
While r - I > 1 

1. Let c = L^J 

2. Measure yf = (4fci)- 1/2 B !c + for i = {1, . . . , m b } 
3- If 2~Z2i 2/i — r i then Z = c, otherwise r = c 

output Set of columns {I — hz + 1, . . . , 1} 



6. Experiments. In this section, we perform a set of simulation studies 
to illustrate finite sample performance of the proposed procedures. We let 
n i = n 2 = n and k\ = ki = k. Theorem 4 and Theorem 6 characterize 
the SNR needed for the passive and active identification of a contiguous 
block, respectively. We demonstrate that the scalings predicted by these 
theorems are sharp by plotting the probability of successful recovery against 
appropriately rescaled SNR and showing that the curves for different values 
of n and k line up. 

Experiment 1. Figure 1 shows the probability of successful localization 
of B* using B defined in Eq. (4.2) plotted against n~ 1 v / ^w,*SNR, where the 
number of measurements m = 100. Each plot in Figure 1 represents different 
relationship between k and n; in the first plot, k = O(logn), in the second 
k = 0(y/n), while in the third plot k = 0{n). The dashed vertical line 
denotes the threshold position for the scaled SNR at which the probability 
of success is larger than 0.95. We observe that irrespective of the problem 
size and the relationship between n and k, Theorem 4 tightly characterizes 
the minimum SNR needed for successful identification. 

Experiment 2. Figure 2 shows the probability of successful localization 
of B* using the procedure outlined in Section 5.2., with m = 500 adaptively 
chosen measurements, plotted against the scaled SNR. The SNR is scaled 
by n~ l y/mk 2 in the first two plots where k = O(logn) and k = 0(^Jn) 
respectively, while in the third plot the SNR is scaled by ^/mk/ 'log k as k = 
0{n). The dashed vertical line denotes the threshold position for the scaled 
SNR at which the probability of success is larger than 0.95. We observe that 
Theorem 6 sharply characterizes the minimum SNR needed for successful 
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Fig 1. Probability of success with passive measurements (averaged over 100 simula- 
tion runs). 



identification. 




Fig 2. Probability of success with adoptively chosen measurements (averaged over 
100 simulation runs). 



7. Discussion. In this paper, we establish the fundamental limits for 
the problem of detecting and localizing a block of weak activation in a data 
matrix from either adaptive or non- adaptive compressive measurements. Our 
bounds precisely characterize the tradeoff between signal-to-noise ratio, size 
of matrix, size of sub-matrix and number of measurements. We also demon- 
strate constructive computationally efficient procedures that achieve these 
bounds. Contrary to recent results for sparse vectors which demonstrate 
that contiguous structure for the activation and the ability to choose mea- 
surements adaptively play a negligible role in detection and localization, our 
results indicate that both the block-structure of the activation and adaptive 
measurement design significantly improve the localization performance for 
data matrices. An intuitive explanation for why adaptive sampling helps in 
the structured case is that in this case it is possible to quickly focus the 
sampling using a compressive binary search procedure, and then exploit the 
structure for exact localization. In the unstructured case however the sig- 
nal can be spread out and the adaptive procedure has no way to rule out 
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candidate locations quickly and has to repeatedly measure essentially all 
locations. 

In this paper, we assumed that an ordering of rows and columns of 
the data matrix is available. Such an ordering may be obtained by a pre- 
processing step that clusters the rows and columns of the matrix. However, 
the general problem of recovering an activated block within a randomly 
permuted data matrix, commonly known as biclustering, is also important. 
The biclustering problem can be harder than the un-permuted setting as 
established in Kolar et al. (2011), at least when all the matrix entries can 
be directly observed and we hope to address its compressive analog in future 
work. 

One important open question, that remains unsolved, is the problem of 
finding the size of the activation block in a data dependent way. At the mo- 
ment we are not aware of procedures that can localize the activation block 
at the minimax SNR without the knowledge of its size. Butucea and Ingster 
(2011) propose test procedures, under a slightly different model, for detec- 
tion of the activation block that do not require the knowledge of the size, 
but work with a collection of sizes that contain the true size. However, the 
price for being agnostic to the size is reflected in the established rates, which 
reflect the difficulty of detecting the hardest activation block in the collec- 
tion. Therefore, even for the problem of detection, adaptation to the size is 
an open problem. 

APPENDIX A: PROOFS OF MAIN RESULTS 

In this appendix, we collect proofs of the results stated in the paper. 
Throughout the proofs, we will denote ci,C2, . . . positive constants that may 
change their value from line to line. 

A.l. Proof of Theorem 1. We lower bound the Bayes risk of any test 
T. Recall, the null and alternate hypothesis, defined in Eq. (2.3), 

Hq : A = niXn2 

Hi: A = (aij) with = \x I{(ij) e s}, B £ B. 

We will consider a uniform prior over the alternatives ir, and bound the 
average risk 

Rn{T) = P [T = 1] + E A ~*F A [T = 0], 

which provides a lower bound on the worst case risk of T. 

Under the prior ir, the hypothesis testing becomes to distinguish 

Hq '■ A = ni xri2 

Hi : A = (aij) with ay = E B ^ n ^ K{(i,j)eB} ■ 
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Both Hq and H\ are simple and the likelihood ratio test is optimal by the 
Neyman-Pearson lemma. The likelihood ratio is 

L _K 7T F A [(y i ,X i ) i£[m] ] _ ^nZi^AlyilXi] 



where the second equality follows by decomposing the probabilities by the 
chain rule and observing that Po[-^i|G/j,-Xj)je[i-i]] = PA[-Xi|(j/j,-Xj)je[i-i]]> 
since the sampling strategy (whether active or passive) is the same irrespec- 
tive of the true hypothesis. 

The likelihood ratio can be further simplified as 

L = E^exp n — 2 I . 

The average risk of the likelihood ratio test 

R n (T) = l-^\\E n P A -F \\ TV 

is determined by the total variation distance between the mixture of alter- 
natives from the null. 

By Pinkser's inequality Tsybakov (2009), 



and 



KL(F ,E„F A ) = -E logL 



< 



-E^E 



2yMAXi) - tr(AXi) 2 



2a 2 

i=l 



in 



E^Eq 



tr (AXi 



,2 



2a 2 
i=i 

<^\\C\\op, 

where the first inequality follows by applying the Jensen's inequality followed 
by Fubini's theorem, and the second inequality follows using the fact that 
\\Xi\\ 2 F = !> where C G M nin2Xnin2 . 

To describe the entries of C, consider the invertible map r from a linear 
index in {!,... , n\ri2} to an entry of A. Now, C„ = n 2 'K. w P A \A T u\ = 1] and 



C i:j = n 2 ^P A [A T u) = l,A T{j) = 1]. 



r(0 

(0 = L ^r(j) 



16 



S. BALAKRISHNAN ET AL. 



To bound the operator norm of C we make two observations. Firstly, 
because of the contiguous structure of the activation pattern, in any row of 
C there are at most k\k 2 non-zero entries. Secondly, each non-zero entry in 
C is of magnitude at most ii 2 k\k 2 / {n\ — ki)(n 2 — k 2 ). 

Now, noting that 

\\C\\ op < maxV \C jk \ < v 2 k\k\j(nx - h)(n 2 - k 2 ) 

i — 

J k 

from which we obtain a bound on the KL divergence. 
Now, this gives us that 



Rk(T) > 1 - hk 2 fi 



m 



16(ni - ki)(n 2 - k 2 ) 
proving the lower bound on the minimax risk. 

A. 2. Proof of Theorem 2. Define t = ■^^YliLiVi- ^ is eas Y to see 
that under H , t ~ jV(0, a 2 ) while under H 1 , t ~ N{^J ^^kik 2 ^,a 2 ). The 
theorem now follows from an application of standard Gaussian tail bounds. 

A. 3. Proof of Theorem 3 . Without loss of generality we assume k\ < 
k 2 . Consider, two distributions Pi and P2, where Pi is induced by matrix A\ 
when the activation block B = B\ = [1, . . . , fei][l, . . . , k 2 ] and P2 is induced 
by matrix A 2 when the activation block B = B 2 = [1, . . . , ki] [2, . . . , k 2 + 1]. 

Following the proof of Theorem 5. 

KL(Pi,P 2 ) = E Pl log^ 

1 m 

= ^2E Pl M A 2 x i) ~ tr(AiX 4 )) 2 

i=l 

fi 2 mk\ 
a 2 ri\n 2 

using the fact that Xi is a random Gaussian matrix with independent entries 
of variance — ^— . 

Now, noting that the minimax risk 

R> 1- VKL(Pi,P 2 )/8 

For the second part of the theorem, we consider F 2 , . . . , Pt+i, where t = 
{n\ — ki)(n 2 — k 2 ), each of which is induced by a B which does not overlap 
with B\. 
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The same calculation now gives 



/j, 2 mk\k,2 



KL(Pi,Py) < -= 

Now, applying the multiple hypothesis version of Fano's inequality (see The- 
orem 2.5 in Tsybakov, 2009) we arrive at the second part of the theorem. 

A.4. Proof of Theorem 4. Let z itB = ^2( a ,b)eB x iM and Z B = (zi,b, ■ ■ ■ , z m , B )' . 
With this, we can write the loss function defined in Eq. (4.1) as 

(A.l) f(B) :=min||/iz B -y||i. 

Let A(B) = f(B) — f(B*) and observe that an error is made if A(B) < 
for B 7^ B* . Therefore, 

P[error] = F[U BeB \ B ,{A(B) < 0}]. 

Under the conditions of the theorem, we will show that A(B) > for all 
B G B\B* with large probability. 

The following lemma shows that for any fixed B, the event {A(B) < 0} 
occurs with exponentially small probability. 

Lemma 7. Fix any B G B\B* . Then 
(A.2) F[A(B) < 0] < exp ( -c x W?™\ B *\ B \ \ + C2 e xp(- C3 m). 

Note that, under the assumptions of the theorem, the first term in Eq. (A.2) 
dominates the second term and hence will be put into the constant c\. 

Define N(l) = \{B G B : \BAB * \ = l}\ to be the number of elements in 
B whose with symmetric difference with B* is equal to I. Note that N(l) = 
0(1) for any I. Using the union bound 
(A.3) 

F[U BeB {A(B) < 0}] 

< > exp -ci— 2 + > iV(Z)exp -ci~2 

BeB,\BAB*\=2k!k2 V ' Z<2fcifc 2 V 7 

/ fi 2 k 1 k 2 m\ ( /i 2 min(A;i,A; 2 )m\ 

< c 2 (ni - ki)[ri2 - k2) exp — ci — 7. + c 3 /ci/c 2 exp —c\ 5 . 

V a z nin2 I V a z n\n2 I 
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Choosing 



n x n 2 



rn 



log(2/<5) max 



log max(/ci, k 2 ) logmax(ni — k±,n 2 — k 2 
mm(ki,k 2 ) 



k\k 2 



each term in Eq. (A. 3) will be smaller than 5/2, with an appropriately chosen 
constant C\. 

We finish the proof of the theorem, by proving Lemma 7. 
Proof of Lemma 7. For any B G B, let 

p, B = argmin||//z B - y||| 
= \\Z B \\ 2 2 Z' B Y. 
Note that /2b* = A 4 + II^bII^ 2 ^b 6 - 



Let 



H_b — ||Zb|| 2 2 ZbZ' b 



H B — I — ||Zb|| 2 2 Z b Z' b 
be the projection matrices and write 

/(B*) = ||H^.€||1 
f(B) = \\H B (Z B ^* + e)\\ 2 

Now, 



+ (/i*) 2 ||H£Z B .||£ + 2e'H^Z B ./i 



/tt-L' 



A(S) = IIH^I^ - ||H^e||^ + (/i*) 2 ||H^Z fl .|l2 + 2e'H B Z B */x* . 

" v ' V * ' 

Ti T 2 

Let Vi, V 2 ~ Xm-i- Observe that Ti ~ cx 2 (yi - V2). 
(A.4) 



l^i| > 



a (m — l)e 



< 2P 



1 2 , n ^ (m- l)e ' 
Xm-i - m + 1| > 



< 2exp 



3(m - 1) 
256 



using Eq. (B.4), as long as e G [0,2). 

To analyze the term T 2 , we condition on X, so that 

r 2 |X~AA(^,4a 2 /I) 

where ju = (n*) 2 \ \H B Z B * \ \ 2 - This gives 

F[T 2 < £/2|X] = P[AA(0, 1) > v^/(4<t)|X] 



COMPRESSIVE RECOVERY OF BLOCK-STRUCTURED ACTIVATIONS 19 



Next, we show how to control | |H^Z_b* 1 1|- Writing Z B * = Z B — Z B \ B * + 
Z B *\ B , simple algebra gives 



I|h b Zb* || 2 

= ||H B Z B *\ B ||| + ||HsZ B \ B *||! — 2Z' Bt \ B H B Z B \ B * 
= I|HbZ b *\ b ||2 + \\Z B \ B * - Z B ,\ B \\% — \\Z B *\ B \\l - 

> l|H B Z B *\ B ||2 + ||Z B yB* — Z B »\ B ||2 — ||Z B ,\ B || 2 

Define the event 



(( Z B\B* - Z B*\b)' Z b) 2 - ( Z b.\b Z b)' 



(( z b\b* - Z B »\ B )'Z B ) 2 



IIZbIII 



£( e ) - \ H H B Z B*\sll2 > 



2 ^ (1 - e)(m - 1 p / || ZflNfl . _ Zb . vb ||2 > L 1 " e)2m|5*\ J B| 



ni?i2 

2 ^ (l + e)m|S*\B 



n\n 2 



n\n 2 



n j|| ZB lll>(LziHH 



n{n z i>-\i>iiis 
n { l(ZflXB ._ WZBl ,(L±«M}, 

such that, using the concentration results in Appendix B, 

W[£(e) C ] < ciexp( —C2me ). 
On the event £{e) we have that 



nin 2 



IH^Z 



Therefore, 



> 



>ci 



m\B*\B\ 
n\n 2 
m\B*\B 



3(l-e)-(l + e) 



(1 + e) 2 \B*\B\ 
~T^~e \Bj~ 



(l-e)\B*\B\ 



n\n 2 



n\n 2 



F[T 2 < £/2|X] < 



(A.5) 



^(0,1) > 01^/2^1 
a V nin 2 



Ci 



< exp — c\ 



(n*) 2 m\B*\B\ 



a^ri\n 2 

Combining Eq. (A. 4) and Eq. (A.5) completes the proof. 



+ c 2 exp(-c 3 me 2 ). 



□ 



20 



S. BALAKRISHNAN ET AL. 



A. 5. Proof of Theorem 5. The proof will proceed via two separate 
constructions. At a high level these constructions are intended to capture 
the difficulty of exactly and approximately localizing the activation block. 

Construction 1 - approximate localization: Let us define three dis- 
tributions: Po corresponding to no bicluster, Pi which is a uniform mixture 
over the distributions induced by having the top-left corner of the bicluster 
in the left half of the matrix and P2 which is a uniform mixture over the 
distributions induced by having the top-left corner of the bicluster in the 
right half of the matrix. 

We first upper bound the total variation between Pi and P2. This results 
directly in a lower bound for the problem of distinguishing whether the top- 
left corner of the bicluster is in the left or right half of the matrix, which in 
turn is a lower bound for the localization of the bicluster. 

Now notice that, 

||Pi-P 2 ||tv < 2||P -Pi||tv + 2||Po-P2||tv 
< ^L(P ,Pi) + ^L(P ,P 2 ) 



Notice that KL(¥q,¥i) is exactly the quantity we have to upper bound to 
produce a lower bound on the signal strength for detecting whether there is 
a bicluster in the left half of the matrix or not. At least from a lower bound 
perspective this reduces the problem of localization to that of detection. We 
can now apply a slight modification of the proof of Theorem 1 to obtain that 

KL(F ,F 1 ) = KL(F ,F 2 )< ™^\k\ 



(rai - ki)(n 2 /2 - k 2 ) 
Noting that the minimax risk R for distinguishing Pi from P 2 



R = l- -||Pt -P 2 tv > 1 



2" n v ~ V 2(ni-fci)(n 2 /2-fc 2 ) 

Construction 2 - exact localization: Without loss of generality we 
assume k\ <k 2 . Consider, two distributions Pi and P 2 , where Pi is induced 
by matrix A\ when the activation block B = B\ = [1, . . . ,&i][l, . . . , k 2 ] 
and P 2 is induced by matrix A 2 when the activation block B = B 2 = 
[l,...,k l ][2,...,k 2 + l]. 
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Now, following the same argument as in the proof of Theorem 1, we have 

m ( 1 \ 
AX(Pi,P 2 ) = E Pl ^(--^ [(yt-tr(^i)) 2 -(yi-tr(i42^i)) 2 ]) 
i=i ^ 17 ' 

= ^2E Pl ^ [tr(A 2 A^) 2 - tr^Xi) 2 + 2y i tr(A 1 X i ) - 2yMA 2 Xi)} 

i=l 

m / \ 2 m 

= ^Ep, ^ tr(A a XQ - tr^jq = _E Pl ^t 2 

i=i V s 7 i=i 

Now, with some abuse of notation, 



*i - V \ ^ X ij ~ E ^ 

ViGBi\S 2 JS-B 2 \-Bi 



By using Cauchy-Schwarz we get 



t? < 2/i 2 A:i E X ij ^ V^i 

j<EBiAB 2 



since H^QH 2 ? = 1. 



This gives us that, 



A'L(Pi,P 2 ) < 



a 2 



Together with a similar construction for the case when k 2 < k± we get 

mmin(fci, fc 2 )^ 2 



JTL(Pi,P 2 ) < 



<7 2 



Once again noting (by Pinsker's theorem), 



R>1- VKm,F 2 )/S > 1 - ^^^M. 
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Combining the approximate and exact localization bounds we get, 
R > max 1 - 



mmin(fci, k 2 )p? / mp?k\k\ 

8^ ' 1_ V 2(n 1 -k 1 )(n 2 /2-k 2 ) 



Thus, we get for any < a < 1, R> a if 



( mm.in(ki,k 2 )u' 2 / mii 2 k?k% \ 

v — ^ — 'V^-^k^-mJ- 1-0 

A. 6. Proof of Theorem 6. As with the lower bound the localization 
algorithm and analysis is naturally divided into two phases. An approximate 
localization phase and an exact localization one. We will analyze each of 
these in turn. To ease presentation we will assume n\ is a dyadic multiple 
of 2k\ and n 2 a dyadic multiple of 2k 2 . Straightforward modifications are 
possible when this is not the case. 

Approximate localization: The approximate localization phase pro- 
ceeds by a modification of the compressive binary search (CBS) procedure 
of Malloy and Nowak (2012) (see also Davenport and Arias-Castro (2012)) 
on the matrix A. 

We will run this modified CBS procedure twice on two sets of blocks of 
the matrix A. The first set consists of the blocks 

£>! = {B u = [1, . . . , 2fci][l, . . . , 2k 2 ] U B 12 = [2h + 1, . . . , 4fti] [1, . . . , 2k 2 ] U . . . 

• • • U B lnin2 / 4klk2 = [m -2ki,.. .,ni][n 2 - 2k 2 , ...,n 2 ]} 

The second set consists of the blocks 

V 2 = {B 21 = [k u . . . , 3fo] [k 2 , 3k 2 ] U B 22 = [3*i + 1, . . • , 5h][k 2 , . . . , 3A: 2 ] U . 

U • • • 4i« 2 /«it 2 = [«1 - kl,...,nl, 1,. . . , fci][n2 - k2, ...,n2,l, . . . , k 

Notice that the entire block of activation is always fully contained in one 
of these blocks. The output of the CBS procedure when run on these two 
collections is two blocks - one from the first collection and the second from 
the second collection. We define an approximate localization error to be the 
event in which neither of the two blocks returned fully contains the block of 
activation. 

Without loss of generality let us assume that the activation block is fully 
contained in some block from the first collection. Once we have fixed the 
collection of blocks the CBS procedure is invariant to reordering of the 
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blocks, so without loss of generality we can consider the case when the 
activation block is contained in 

The analysis proceeds exactly as in Malloy and Nowak (2012), we detail 
the differences arising from having a block of activation as opposed to a 
single activation in a vector. Notice, that the binary search procedure on 
the first collection of blocks proceeds for 

rounds. Now, we can bound the probability of error of the procedure by a 
union bound as 

SO 
s=l 

where 

s ( m^-^hkin 2 \ 
w ~ Ml , m s a A 

Recall, the allocation scheme: for m > 2so, Tn s = L( m ~~ so)-s2" s— 1 J + 1 
and observe that X^s=i m s — m 

Now, using the Gaussian tail bound 

P[N(0,l)>t]<±exp(-t 2 /2) 

we see that 

P e < - > exp - A 

2 ^ I 4nin 2 cr 2 J 

s=l x ' 

Now, observe that m s > (m — so)s2 _s_1 and m > 2so, so m s > ms2~ s ~ 2 . 
It is now straightforward to verify that if 

we have P e < 5. 

Let us revisit what we have shown so far: if \x is large enough then one 
of the two runs of the CBS procedure will return a block of size {2k\ x 2/c 2 ) 
which fully contains the block of activation, with probability at least 1 — 25. 

Exact localization: In the 1 — 5 probability event described above, we 
have a block of at most (4&i x 4/c 2 ) which contains the full block of activation 
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(for simplicity we disregard the fact that we know that the block is actually 
in one of two (2ki x 2^) blocks). 

Let us first identify the active columns. First, notice that one of the first, 
/c2 + 1st, 2ft2 + 1st or 3Jv2 + 1st column must be active. Let us devote 4m mea- 
surements to identifying the active column amongst these. The procedure is 
straightforward: measure each column m times, and pick the largest. 

It is easy to show that the active column results in a draw from AA(y / fci/im/2, ma 2 ) 
and the non-active columns result in draws from jV(0, ma 2 ). 

Using the same Gaussian tail bound as before it is easy to show that if 



we successfully find the active column with probability at least 1 — 5. 

So far, we have identified an active column and localized the columns 
of the activation block to one of 2/^2 columns. We will use m more mea- 
surements to find the remaining active columns. Rather, than test each of 
the 2&2 columns we will do a binary search. This will require us to test at 
most t = 2[~logA;2l < 31ogA)2 columns, and we will devote m/(31og&2) mea- 
surements to each column. We will need to threshold these measurements 
at 



and declare a row as active if its average is larger than this. 

It is easy to show that this binary search procedure successfully finds all 
active columns with probability at least 1 — 5 if 



We repeat this procedure to identify the active rows. 

Putting everything together: Total number of measurements used: 

1. Two rounds of CBS: 2m 

2. Identifying first active column and first active row: 8m 

3. Identifying remaining active rows and columns: 2m 

This is a total of 12m measurements. Each of these steps fails with a prob- 
ability at most 5, for a total of 65. 
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Now, re-adjusting constants we obtain, if 



/ /192cr 2 nin 2l /3 \ /384cj 2 log max(fci, k 2 ) , / 18 log maxffci, k 2 " 
H > max i / log - + 1 , 4 / — — — log 



mk\k\ \5 / ' A/ m min(fei, ^2) \ 

then we successfully localize the matrix with probability at least 1 — 5. 
Stated more succinctly we require 



/i > O max 



'<7 2 nin 2 / cr 2 



mk\k\ ' y min(/ci, A;2)m y y 
This matches the lower bound up to log k factors. 

APPENDIX B: COLLECTION OF CONCENTRATION RESULTS 

In this section, we collect useful results on tail bounds of various random 
quantities used throughout the paper. We start by stating a lower and upper 
bound on the survival function of the standard normal random variable. Let 
Z ~ A/"(0, 1) be a standard normal random variable. Then for t > 

(B.l) -L ^_exp(-t 2 /2) < F(Z >t)< ^ = I eX p(-t 2 /2). 

V27T t z + 1 V27T t 

B.l. Tail bounds for Chi-squared variables. Throughout the paper 
we will often use one of the following tail bounds for central x 2 random 
variables. These are well known and proofs can be found in the original 
papers. 

Lemma 8 (Laurent and Massart (2000)). Let X ~ Xd- For al1 x ^ °> 

(B.2) F[X -d> 2Vdx + 2x] < exp(-x) 

(B.3) F[X - d < -2Vdx] < exp(-x). 

Lemma 9 (Johnstone and Lu (2009)). Let X ~ x% then 

(B.4) F[\d~ l X - 1| > x] < exp(-^dx 2 ), x G [0, -). 

The following result provide a tail bound for non-central x 2 random vari- 
able with non-centrality parameter v. 
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Lemma 10 (Birge (2001)). Let X ~ yJH, then for all x > 
(B.5) P[X >{d + v) + 2yJ{d + 2v)x + 2x] < 

(B.6) F[X <{d + u)- 2 y / {d + 2v)x\ < exp(-x). 

Using the above results, we have a tail bound for sum of product-normal 
random variables. 

Lemma 11. Let Z = {Z a ,Z b ) ~ A/2(0, 0, a aa , o~ bb , a ab ) be a bivariate 
Normal random variable and let (zi a , zn) *~ Z, i = l,...,n. Then or all 

t e [o, v ab /2) 



(B.7) 



n 



0~ab 



> t 



< 4exp 



3nV 



16^ 



/ ZiaZi 
i 

where v ab = max{(l - p a b)^J o- aa Obb, (1 + Pab)y/oa^Obb\ ■ 
Proof. Let z' ia = Ziaj^fo^. Then using (B.4) 
1 " 

P[|- y^ Zi a Zib - a ab \ > t] 
n 

t 



i=l 



\ — 2_^i Z ia Z ib — Pab\ > 



i=l 



sj 0~ aa O~ bb 

£((4 + 4) 2 - 2(1 + Pab)) ~ {{z' ia ~ Abf " 2(1 " Pab))\ > 

2nt , 



Ant 



i=l 



\J 0~ aa O~ bb 



<p[|£((4 + 4) 2 -2(i + /U)I 



> 



i=l 



+ P[|^((4-4) 2 -2(l-Pa6))|> 



\JO~aaO~bb 

2nt 



i=i 



\Jo- aa o~bb 



nt . 



<2¥[\ X z n -n\ > — ] <4exp( 



Vab 



3nt 2 



where u ab = max{ ( 1 - p ab ) ^T, aa S bf) , {l + p a b)V^aa^bb} and t £ [0,u a /2). □ 

Corollary 12. Let Z\ and Z2 be two independent standard Normal 
random variables and let Xj *~ Z\Zi, i = 1 . . . n. Then for t £ [0, 1/2) 



(B. 



In" 1 x i\ > t] < 4exp( 

ie[n] 



3nr 
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